Chapter 16 - Exercises¶

In [1]:
library(tidyverse)
library(bayesrules)
library(bayesplot)
library(rstan)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(forcats)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This is bayesplot version 1.10.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

Loading required package: StanHeaders

rstan (Version 2.21.8, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)


Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: Rcpp

This is rstanarm version 2.21.4

- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!

- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.

- For execution on a local, multicore CPU with excess RAM we recommend calling

  options(mc.cores = parallel::detectCores())


Attaching package: ‘rstanarm’


The following object is masked from ‘package:rstan’:

    loo


Exercise 16.6¶

In [2]:
big_word_club <- big_word_club %>% 
  filter(treat == 1) %>% 
  select(school_id, score_pct_change) %>% 
  na.omit()

head( big_word_club )
A data.frame: 6 × 2
school_idscore_pct_change
<fct><dbl>
12-8.823529
2214.285714
3217.857143
42 6.666667
5212.500000
7223.076923

a)¶

In [3]:
big_word_club %>% 
    distinct( school_id ) %>% 
    nrow()
26

26 schools.

b)¶

In [4]:
big_word_club %>% 
    group_by( school_id ) %>% 
    summarize( size=n() ) %>% 
    arrange( size ) %>% 
    filter(row_number()==1 | row_number()==n())
A tibble: 2 × 2
school_idsize
<fct><int>
2 12
1017

Each school has between 12-17 students in the project.

c)¶

In [5]:
# reorder school_ids
big_word_club <- big_word_club  %>% 
    mutate( school_id = fct_reorder( school_id, score_pct_change, .fun='mean' ) )

school_changes <- big_word_club %>% 
    group_by( school_id ) %>% 
    summarize( mean_pct_change=mean(score_pct_change) )
In [6]:
options(repr.plot.width=15, repr.plot.height=5)
ggplot( school_changes %>% arrange( mean_pct_change ), aes( x=school_id, y=mean_pct_change ) ) + geom_point()

School 6 has the lowest change, school 17 the highest.

d)¶

Density plot:

In [7]:
ggplot( big_word_club ) + geom_density( aes(x=score_pct_change, color=school_id) )

Box plot:

In [8]:
ggplot( big_word_club ) + geom_boxplot( aes(y=score_pct_change, x=school_id) )

It appears that the variability of students within a school (within-group variability) is quite large and almost as large as the between-group variability.

Exercise 16.7¶

a)¶

Not all possible schools were in the program. We might want to predict percent changes for other schools not yet in the dataset, therefore school_id should be a grouping variable. Moreover, there might be better and worse schools and which school a student is enrolled in might be an important factor. Complete pooling would ignore different school-level performances and predict a mean percent change for all students with a broad variance. In contrast, no pooling would only focus on the individual schools and overfit to them, even if they contributed only a small number of participants.

b)¶

$\mu$: global average of percent change, $\mu_j$: school average of percent change for school $j$.

c)¶

$\sigma_y$: standard deviation on percent changes on school-level (within-group variance), $\sigma_\mu$: standard deviation on the different school averages.

Exercise 16.8¶

a)¶

In [9]:
big_words_hierarchical <- stan_glmer(
  score_pct_change ~ (1 | school_id), 
  data = big_word_club, family = gaussian,
  prior_intercept = normal(0, 1, autoscale = TRUE),
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.4 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.31066 seconds (Warm-up)
Chain 1:                1.94236 seconds (Sampling)
Chain 1:                3.25301 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 4.3e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.43 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.3092 seconds (Warm-up)
Chain 2:                1.42783 seconds (Sampling)
Chain 2:                2.73704 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.7717 seconds (Warm-up)
Chain 3:                1.87362 seconds (Sampling)
Chain 3:                3.64532 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.5e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 1.86371 seconds (Warm-up)
Chain 4:                1.96377 seconds (Sampling)
Chain 4:                3.82748 seconds (Total)
Chain 4: 

b)¶

MCMC diagnostics:

In [10]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(big_words_hierarchical)
mcmc_dens_overlay(big_words_hierarchical)
mcmc_acf(big_words_hierarchical)
neff_ratio(big_words_hierarchical)
rhat(big_words_hierarchical)
Warning message:
“The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the bayesplot package.
  Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.”
(Intercept)
0.7766
b[(Intercept) school_id:6]
0.62275
b[(Intercept) school_id:34]
0.64765
b[(Intercept) school_id:14]
0.67435
b[(Intercept) school_id:41]
0.8125
b[(Intercept) school_id:37]
0.78965
b[(Intercept) school_id:44]
0.995
b[(Intercept) school_id:10]
0.97345
b[(Intercept) school_id:7]
1.1128
b[(Intercept) school_id:43]
0.9876
b[(Intercept) school_id:21]
1.10075
b[(Intercept) school_id:32]
1.0612
b[(Intercept) school_id:22]
1.16885
b[(Intercept) school_id:29]
1.30355
b[(Intercept) school_id:16]
1.24895
b[(Intercept) school_id:25]
1.24195
b[(Intercept) school_id:11]
1.2501
b[(Intercept) school_id:38]
1.21305
b[(Intercept) school_id:4]
1.1441
b[(Intercept) school_id:39]
1.10055
b[(Intercept) school_id:28]
0.9889
b[(Intercept) school_id:18]
0.88975
b[(Intercept) school_id:8]
0.74145
b[(Intercept) school_id:9]
0.7059
b[(Intercept) school_id:3]
0.6555
b[(Intercept) school_id:2]
0.626
b[(Intercept) school_id:17]
0.59995
sigma
1.0634
Sigma[school_id:(Intercept),(Intercept)]
0.36505
(Intercept)
0.999976008794019
b[(Intercept) school_id:6]
0.999988353846605
b[(Intercept) school_id:34]
0.999925957493955
b[(Intercept) school_id:14]
0.999901952066752
b[(Intercept) school_id:41]
0.999849528936406
b[(Intercept) school_id:37]
0.999936806685064
b[(Intercept) school_id:44]
0.99991468891112
b[(Intercept) school_id:10]
0.999907868830651
b[(Intercept) school_id:7]
0.999836169412077
b[(Intercept) school_id:43]
0.999969710168226
b[(Intercept) school_id:21]
0.999989547587096
b[(Intercept) school_id:32]
1.00001065660817
b[(Intercept) school_id:22]
0.999925388884616
b[(Intercept) school_id:29]
1.00005445699669
b[(Intercept) school_id:16]
0.999964077047965
b[(Intercept) school_id:25]
1.00007016764975
b[(Intercept) school_id:11]
1.00009686726748
b[(Intercept) school_id:38]
0.999933598403967
b[(Intercept) school_id:4]
0.99988626582246
b[(Intercept) school_id:39]
0.999874355668055
b[(Intercept) school_id:28]
1.0000257796991
b[(Intercept) school_id:18]
1.00002658520685
b[(Intercept) school_id:8]
1.00008666823393
b[(Intercept) school_id:9]
1.00001063437738
b[(Intercept) school_id:3]
0.999939743179723
b[(Intercept) school_id:2]
1.00005677888947
b[(Intercept) school_id:17]
1.00025613401639
sigma
0.999877300485468
Sigma[school_id:(Intercept),(Intercept)]
1.0001126685321

Looks converged and mixing is reasonably fast.

c)¶

In [11]:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(big_words_hierarchical) + xlab("score_pct_change")

Looks more or less ok, the data distribution is not perfectly Gaussian, but approximately.

Exercise 16.9¶

a)¶

In [12]:
tidy(big_words_hierarchical, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 1 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept)6.2798691.050564.1678568.415107

The average percent change per school is somewhere between 4.2 and 8.4% with a probability of 95%.

b)¶

Yes, the 95% credible interval does not include zero. Of course it has to be checked with a control group whether the program is the causal factor.

c)¶

In [13]:
sigma_table <- tidy(big_words_hierarchical, effects = "ran_pars")
sigma_table
A tibble: 2 × 3
termgroupestimate
<chr><chr><dbl>
sd_(Intercept).school_idschool_id 2.841094
sd_Observation.Residual Residual 16.944405

$\sigma_y > \sigma_\mu$

In [14]:
sigma_mu <- sigma_table$estimate[[1]]
sigma_y <- sigma_table$estimate[[2]]

Part explained by within-group variation:

In [15]:
sigma_y^2 / (sigma_mu^2 + sigma_y^2)
0.97265503311967

Part explained by between-group variation:

In [16]:
sigma_mu^2 / (sigma_mu^2 + sigma_y^2)
0.0273449668803298

Almost all of the variance can be explained by the large variances within the groups, therefore the structuring into groups does not help significantly to reduce the overall variance. It appears that there are in general not really significant differences between different schools.

Exercise 16.10¶

Confidence intervals:

In [17]:
school_summary <- tidy(big_words_hierarchical, effects = "ran_vals", conf.int = TRUE, conf.level = 0.80)
head( school_summary )
A tibble: 6 × 7
levelgrouptermestimatestd.errorconf.lowconf.high
<chr><chr><chr><dbl><dbl><dbl><dbl>
6 school_id(Intercept)-1.15731792.057899-5.3650020.7743921
34school_id(Intercept)-1.10718842.028525-5.2811170.8561878
14school_id(Intercept)-1.10178742.004915-5.2924670.8616252
41school_id(Intercept)-0.81379511.865341-4.7085641.1790100
37school_id(Intercept)-0.78186821.831196-4.5789751.1599582
44school_id(Intercept)-0.47060731.710459-3.9227031.5944033

Extract values directly from posterior simulation:

In [18]:
school_chains <- big_words_hierarchical %>%
  spread_draws(`(Intercept)`, b[,school_id]) %>% 
  mutate(mu_j = `(Intercept)` + b) 

head( school_chains )
A grouped_df: 6 × 7
.chain.iteration.draw(Intercept)school_idbmu_j
<int><int><int><dbl><chr><dbl><dbl>
1114.387914school_id:10-0.34756874.040345
1114.387914school_id:11 0.90491505.292829
1114.387914school_id:14 0.13078154.518695
1114.387914school_id:16-1.32846113.059453
1114.387914school_id:17 4.50419838.892112
1114.387914school_id:18 1.27485405.662768
In [19]:
school_summary_scaled <- school_chains %>% 
  select(-`(Intercept)`, -b) %>% 
  mean_qi(.width = 0.80) %>% 
  mutate(school_id = fct_reorder(school_id, mu_j))

head( school_summary_scaled )
A tibble: 6 × 7
school_idmu_j.lower.upper.width.point.interval
<fct><dbl><dbl><dbl><dbl><chr><chr>
school_id:105.4375702.4592220 8.0801180.8meanqi
school_id:116.6192423.7894297 9.6400710.8meanqi
school_id:144.5743340.8913668 7.5173790.8meanqi
school_id:166.1918053.3454124 9.0604490.8meanqi
school_id:178.2726525.226648812.2114490.8meanqi
school_id:187.3571904.575521910.6848110.8meanqi
In [20]:
ggplot(school_summary_scaled, 
       aes(x = school_id, y = mu_j, ymin = .lower, ymax = .upper)) +
  geom_pointrange() +
  xaxis_text(angle = 90, hjust = 1) +
  ylim(0, 13)

It is clearly visible here that the variability within schools is large and probably because of the lower number of students it is hard to constrain the values for $\mu_j$.

b)¶

In [21]:
school_chains %>% 
    filter( school_id=="school_id:10") %>% 
    summarize( conf.low=quantile(mu_j, 0.1), conf.high=quantile(mu_j, 0.9))
A tibble: 1 × 3
school_idconf.lowconf.high
<chr><dbl><dbl>
school_id:102.4592228.080118

$\mu_{10}$ is somewhere between 2.5 and 8.1% with a probability of 80%.

c)¶

$H_0$: On average, vocabulary scores at School 10 improved by $\leq 5\%$.

$H_1$: On average, vocabulary scores at School 10 improved by $> 5\%$.

In [22]:
school_chains %>% 
    filter( school_id=="school_id:10") %>% 
    summarize( pH0=mean(mu_j<=5), pH1=mean(mu_j>5), odds=pH1/pH0 )
A tibble: 1 × 4
school_idpH0pH1odds
<chr><dbl><dbl><dbl>
school_id:100.381150.618851.623639

The odds are 1.6:1 that on average vocabulary scores at School 10 improved by $> 5\%$. I would not consider this as very overwhelming evidence.

Exercise 16.11¶

a)¶

School 6:

In [23]:
big_words_hierarchical_df <- data.frame( big_words_hierarchical )
school6_chains <- big_words_hierarchical_df %>% 
    rename(b = b..Intercept..school_id.6.) %>% 
    select(X.Intercept., b, sigma ) %>% 
    mutate( mu_6 = X.Intercept. + b ) %>% 
    mutate( y_new_6 = rnorm(nrow(big_words_hierarchical_df), mean=mu_6, sd=sigma) )

head( school6_chains$y_new_6 )
  1. 8.51463295616551
  2. 16.9726452245404
  3. 15.5450973135622
  4. 6.97332230111473
  5. 15.6025235727268
  6. -5.22853272400229

Bayes Prep School:

In [24]:
bayes_prep_chains <- big_words_hierarchical_df %>%
    mutate( sigma_mu = sqrt(Sigma.school_id..Intercept...Intercept..)) %>% 
    select( mu=X.Intercept., sigma_y=sigma, sigma_mu ) %>% 
    mutate( mu_new=rnorm(nrow(big_words_hierarchical_df), mean=mu, sd=sigma_mu) ) %>% 
    mutate( y_new=rnorm(nrow(big_words_hierarchical_df), mean=mu_new, sd=sigma_y) )

head( bayes_prep_chains$y_new )
  1. 6.62052528261848
  2. 13.1087255778045
  3. 19.8158028587075
  4. 32.4797471883542
  5. -7.72999270909935
  6. -14.4687833040498

b)¶

In [25]:
school6_chains %>% 
  mean_qi(y_new_6, .width = 0.80)
A tibble: 1 × 6
y_new_6.lower.upper.width.point.interval
<dbl><dbl><dbl><dbl><chr><chr>
4.572323-17.1826526.345740.8meanqi
In [26]:
bayes_prep_chains %>% 
  mean_qi(y_new, .width = 0.80)
A tibble: 1 × 6
y_new.lower.upper.width.point.interval
<dbl><dbl><dbl><dbl><chr><chr>
6.422143-15.4983228.310040.8meanqi

The 80% credible intervals for the percent change of a new student are almost similar ([-17.7,26.1] for school 6 and [-15.7,28.4] for the new Bayes Prep school). This makes sense, because the knowledge about school 6 does not help much in reducing variability (as elaborated above - most of the data explained by within-group variation).

This are the standard deviations on group 6 and on the full dataset: they are practically similar

In [27]:
big_word_club %>% 
    filter( school_id==6 ) %>% 
    summarize( count=n(), mean=mean(score_pct_change), sd=sd(score_pct_change) )
A data.frame: 1 × 3
countmeansd
<int><dbl><dbl>
13-1.66046316.31915
In [28]:
big_word_club %>% 
    summarize( count=n(), mean=mean(score_pct_change), sd=sd(score_pct_change) )
A data.frame: 1 × 3
countmeansd
<int><dbl><dbl>
3366.26663417.08279

c)¶

In [29]:
prediction_shortcut <- posterior_predict(
    big_words_hierarchical,
    newdata = data.frame(school_id = c("6", "17", "Bayes Prep"))
)

head( prediction_shortcut )
A matrix: 6 × 3 of type dbl
123
19.36020728.410183 -2.117322
-14.67816620.143041-19.841212
-9.70744112.514682-13.162466
8.076675 1.585082 20.618833
10.987653 4.382616 1.962025
20.45470814.510902 -3.756015
In [30]:
mcmc_areas(prediction_shortcut, prob = 0.8) + ggplot2::scale_y_discrete(labels = c("6", "17", "Bayes Prep"))
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

As already exlained in b), the intervals are almost similar and knowledge of the school id does not help in reducing the variance of the prediction.

d)¶

In [31]:
school_means <- big_word_club %>% 
  group_by(school_id) %>% 
  summarize(count = n(), score_pct_change = mean(score_pct_change))
predictions_hierarchical <- posterior_predict(big_words_hierarchical, newdata = school_means)
head( predictions_hierarchical )
A matrix: 6 × 26 of type dbl
12345678910⋯17181920212223242526
-22.447283-12.694384-10.963307 40.647960-40.792903 8.407682 -3.5384421 4.778908-10.6129207 18.3317621⋯13.527956-16.776267 2.975109 3.471869 1.3943626 7.217671 17.456370 26.841043 24.4667118.708679
-31.028076 36.476210 1.453358 14.039008 -6.798339 6.561587 35.4545970 3.052062 26.6956682-20.6007675⋯10.297960 16.758950 30.849420 17.61232014.7292346 9.259797 11.315089-16.907547-25.17785-4.553267
13.836933 -2.229947 10.743110-13.796497 3.330382 3.318527-19.8119684-4.347346 -0.8386608 2.8053554⋯ 7.556056 10.312567 -2.803082 42.65711135.9015015 5.853133 30.246475 7.613583 15.3205120.201743
-9.983929 1.181368-22.183571 44.144067-20.364714-18.300576 0.1340036 4.872106 4.7380075 3.0431498⋯-6.430184-15.837449 28.287370 3.562904 9.363778141.098212-20.966321 -1.682361 25.5880716.549521
5.205300 17.341943 17.769310 8.285007 2.488671 14.100623 10.175828815.806692 -3.3906898 16.0745671⋯20.050589-25.857926 -1.400901-13.101007-0.3815121 7.504661 8.190641 4.846326 21.0630624.702188
9.302670-24.551091 -6.215559 17.758556-13.864474-12.534279-23.7340549-2.332081-19.3471753 0.4930224⋯23.660922 -7.557138-33.548563 -2.36518810.258788418.491997-19.256947 -5.772193 54.6218715.215701
In [32]:
ppc_intervals(school_means$score_pct_change, yrep = predictions_hierarchical, prob_outer = 0.80) +
  ggplot2::scale_x_continuous(labels = school_means$school_id, breaks = 1:nrow(school_means)) +
  xaxis_text(angle = 90, hjust = 1) + 
  geom_hline(yintercept = 6.3, linetype = "dashed")

All prediction intervals are extremely similar, as before.

Exercise 16.12¶

There is a lot of shrinkage in this analysis, all the posterior intervals are pulled towards the global mean. The posterior mean predictions with weakly-informed priors are roughly the weighted averages

image.png

Since $\sigma_y^2 \gg \sigma_\mu^2$, the posterior means are pulled towards $\bar{y}_\text{global}$. This makes sense, in the light of the large within-group variation, we should not focus too much on individual schools.

Exercise 16.13¶

Exploratory Analysis¶

In [33]:
voices <- voices %>% 
    select( subject, pitch ) %>% 
    na.omit

head( voices )
A tibble: 6 × 2
subjectpitch
<fct><dbl>
A229.7
A237.3
A236.8
A251.0
A267.0
A266.0
In [34]:
voices %>% 
    group_by( subject ) %>% 
    summarize( experiments=n(), mean_pitch=mean(pitch), sd_pitch=sd(pitch) )
A tibble: 6 × 4
subjectexperimentsmean_pitchsd_pitch
<fct><int><dbl><dbl>
A14250.735729.37261
B13145.976941.47626
C14232.035739.27762
D14102.178612.46553
E14258.185731.35594
F14168.978621.04421

Density plot:

In [35]:
ggplot( voices ) + geom_density( aes(x=pitch, color=subject) )

Box plot:

In [36]:
ggplot( voices ) + geom_boxplot( aes(y=pitch, x=subject) )

The within-group variation appears to be smaller than the between-group variation.

Posterior simulation¶

In [37]:
voices_hierarchical <- stan_glmer(
  pitch ~ (1 | subject), 
  data = voices, family = gaussian,
  prior_intercept = normal(0, 1, autoscale = TRUE),
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.9e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.39 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 3.50277 seconds (Warm-up)
Chain 1:                2.77692 seconds (Sampling)
Chain 1:                6.27969 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.4e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 4.62456 seconds (Warm-up)
Chain 2:                3.62207 seconds (Sampling)
Chain 2:                8.24663 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 2.97591 seconds (Warm-up)
Chain 3:                3.06427 seconds (Sampling)
Chain 3:                6.04018 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.4e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 2.31587 seconds (Warm-up)
Chain 4:                2.5324 seconds (Sampling)
Chain 4:                4.84827 seconds (Total)
Chain 4: 

MCMC diagnostics:

In [38]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(big_words_hierarchical)
mcmc_dens_overlay(big_words_hierarchical)
mcmc_acf(big_words_hierarchical)
neff_ratio(big_words_hierarchical)
rhat(big_words_hierarchical)
(Intercept)
0.7766
b[(Intercept) school_id:6]
0.62275
b[(Intercept) school_id:34]
0.64765
b[(Intercept) school_id:14]
0.67435
b[(Intercept) school_id:41]
0.8125
b[(Intercept) school_id:37]
0.78965
b[(Intercept) school_id:44]
0.995
b[(Intercept) school_id:10]
0.97345
b[(Intercept) school_id:7]
1.1128
b[(Intercept) school_id:43]
0.9876
b[(Intercept) school_id:21]
1.10075
b[(Intercept) school_id:32]
1.0612
b[(Intercept) school_id:22]
1.16885
b[(Intercept) school_id:29]
1.30355
b[(Intercept) school_id:16]
1.24895
b[(Intercept) school_id:25]
1.24195
b[(Intercept) school_id:11]
1.2501
b[(Intercept) school_id:38]
1.21305
b[(Intercept) school_id:4]
1.1441
b[(Intercept) school_id:39]
1.10055
b[(Intercept) school_id:28]
0.9889
b[(Intercept) school_id:18]
0.88975
b[(Intercept) school_id:8]
0.74145
b[(Intercept) school_id:9]
0.7059
b[(Intercept) school_id:3]
0.6555
b[(Intercept) school_id:2]
0.626
b[(Intercept) school_id:17]
0.59995
sigma
1.0634
Sigma[school_id:(Intercept),(Intercept)]
0.36505
(Intercept)
0.999976008794019
b[(Intercept) school_id:6]
0.999988353846605
b[(Intercept) school_id:34]
0.999925957493955
b[(Intercept) school_id:14]
0.999901952066752
b[(Intercept) school_id:41]
0.999849528936406
b[(Intercept) school_id:37]
0.999936806685064
b[(Intercept) school_id:44]
0.99991468891112
b[(Intercept) school_id:10]
0.999907868830651
b[(Intercept) school_id:7]
0.999836169412077
b[(Intercept) school_id:43]
0.999969710168226
b[(Intercept) school_id:21]
0.999989547587096
b[(Intercept) school_id:32]
1.00001065660817
b[(Intercept) school_id:22]
0.999925388884616
b[(Intercept) school_id:29]
1.00005445699669
b[(Intercept) school_id:16]
0.999964077047965
b[(Intercept) school_id:25]
1.00007016764975
b[(Intercept) school_id:11]
1.00009686726748
b[(Intercept) school_id:38]
0.999933598403967
b[(Intercept) school_id:4]
0.99988626582246
b[(Intercept) school_id:39]
0.999874355668055
b[(Intercept) school_id:28]
1.0000257796991
b[(Intercept) school_id:18]
1.00002658520685
b[(Intercept) school_id:8]
1.00008666823393
b[(Intercept) school_id:9]
1.00001063437738
b[(Intercept) school_id:3]
0.999939743179723
b[(Intercept) school_id:2]
1.00005677888947
b[(Intercept) school_id:17]
1.00025613401639
sigma
0.999877300485468
Sigma[school_id:(Intercept),(Intercept)]
1.0001126685321

Looks good!

Posterior predictive check¶

In [39]:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(voices_hierarchical) + xlab("pitch")

Looks very reasonable! Even the bimodality is modeled well.

Parameter distributions¶

global mean pitch:

In [40]:
tidy(voices_hierarchical, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 1 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept)165.156627.8114788.14315212.8491

standard deviations:

In [41]:
sigma_table <- tidy(voices_hierarchical, effects = "ran_pars")
sigma_table
A tibble: 2 × 3
termgroupestimate
<chr><chr><dbl>
sd_(Intercept).subject subject 73.57285
sd_Observation.ResidualResidual31.35868

$\sigma_\mu > \sigma_y$

In [42]:
sigma_mu <- sigma_table$estimate[[1]]
sigma_y <- sigma_table$estimate[[2]]

Part explained by within-group variation:

In [43]:
sigma_y^2 / (sigma_mu^2 + sigma_y^2)
0.153739206272041

Part explained by between-group variation:

In [44]:
sigma_mu^2 / (sigma_mu^2 + sigma_y^2)
0.846260793727959

In contrast to the previous example, the dominant part is explained by between-group variation, as suspect in exploratory data analysis.

Posterior credible intervals for mean subject pitches:

In [45]:
voice_chains <- voices_hierarchical %>%
  spread_draws(`(Intercept)`, b[,subject]) %>% 
  mutate(mu_j = `(Intercept)` + b) 

voice_summary_scaled <- voice_chains %>% 
  select(-`(Intercept)`, -b) %>% 
  mean_qi(.width = 0.80) %>% 
  mutate(subject = fct_reorder(subject, mu_j))

ggplot(voice_summary_scaled, 
       aes(x = subject, y = mu_j, ymin = .lower, ymax = .upper)) +
  geom_pointrange() +
  xaxis_text(angle = 90, hjust = 1)

It is very evident that even though pitches in different situations change, each person retains there characteristic pitch and pitches vary much more between different persons.

Predictions for all subjects¶

In [46]:
voice_means <- voices %>% 
  group_by(subject) %>% 
  summarize(count = n(), pitch = mean(pitch))
predictions_hierarchical <- posterior_predict(voices_hierarchical, newdata = voice_means)
head( predictions_hierarchical )
A matrix: 6 × 6 of type dbl
123456
243.0565166.22962285.3676104.80708251.0994190.0720
270.1476 94.92541249.8387111.23562236.8221171.5580
213.6584143.08184138.1591 73.15103216.6668234.9504
298.2836158.77370240.2255109.94794208.7477166.8353
333.0936188.66417210.2554134.38662239.5945188.5028
268.6697147.59585265.2764166.21684265.2950164.7258
In [47]:
ppc_intervals(voice_means$pitch, yrep = predictions_hierarchical, prob_outer = 0.80) +
  ggplot2::scale_x_continuous(labels = voice_means$subject, breaks = 1:nrow(voice_means)) +
  xaxis_text(angle = 90, hjust = 1) + 
  geom_hline(yintercept = 165.2, linetype = "dashed")

Contrary to Exercise 16.12, now there is almost no shrinkage. Since the between-group variation is much larger than the within-group variation, knowledge of the group explains a lot of the variability in the data.